{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### VectorSlicer\n",
    "\n",
    "   VectorSlicer是一个转换器输入特征向量，输出原始特征向量子集。VectorSlicer接收带有特定索引的向量列，通过对这些索引的值进行筛选得到新的向量集。可接受如下两种索引\n",
    "\n",
    "1.整数索引，setIndices()。\n",
    "\n",
    "2.字符串索引代表向量中特征的名字，此类要求向量列有AttributeGroup，因为该工具根据Attribute来匹配名字字段。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import VectorSlicer\n",
    "from pyspark.ml.linalg import Vectors\n",
    "from pyspark.sql.types import Row\n",
    " \n",
    "df = spark.createDataFrame([\n",
    "    Row(userFeatures=Vectors.sparse(3, {0: -2.0, 1: 2.3}),),\n",
    "    Row(userFeatures=Vectors.dense([-2.0, 2.3, 0.0]),)])\n",
    " \n",
    "slicer = VectorSlicer(inputCol=\"userFeatures\", outputCol=\"features\", indices=[1])\n",
    " \n",
    "output = slicer.transform(df)\n",
    " \n",
    "output.select(\"userFeatures\", \"features\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  RFormula\n",
    "\n",
    "     RFormula通过R模型公式来选择列。支持R操作中的部分操作，包括‘~’, ‘.’, ‘:’, ‘+’以及‘-‘，基本操作如下：\n",
    "\n",
    "1. ~分隔目标和对象\n",
    "\n",
    "2. +合并对象，“+ 0”意味着删除空格\n",
    "\n",
    "3. :交互（数值相乘，类别二值化）\n",
    "\n",
    "4. . 除了目标外的全部列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import RFormula\n",
    " \n",
    "dataset = spark.createDataFrame(\n",
    "    [(7, \"US\", 18, 1.0),\n",
    "     (8, \"CA\", 12, 0.0),\n",
    "     (9, \"NZ\", 15, 0.0)],\n",
    "    [\"id\", \"country\", \"hour\", \"clicked\"])\n",
    "formula = RFormula(\n",
    "    formula=\"clicked ~ country + hour\",\n",
    "    featuresCol=\"features\",\n",
    "    labelCol=\"label\")\n",
    "output = formula.fit(dataset).transform(dataset)\n",
    "output.select(\"features\", \"label\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ChiSqSelector\n",
    "\n",
    "  ChiSqSelector代表卡方特征选择。它适用于带有类别特征的标签数据。ChiSqSelector根据类别的独立卡方2检验来对特征排序，然后选取类别标签主要依赖的特征。它类似于选取最有预测能力的特征。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import ChiSqSelector\n",
    "from pyspark.ml.linalg import Vectors\n",
    " \n",
    "df = spark.createDataFrame([\n",
    "    (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),\n",
    "    (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),\n",
    "    (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], [\"id\", \"features\", \"clicked\"])\n",
    " \n",
    "selector = ChiSqSelector(numTopFeatures=1, featuresCol=\"features\",\n",
    "                         outputCol=\"selectedFeatures\", labelCol=\"clicked\")\n",
    " \n",
    "result = selector.fit(df).transform(df)\n",
    "result.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
